Skip to content

[pull] main from triggerdotdev:main#215

Merged
pull[bot] merged 7 commits into
Dustin4444:mainfrom
triggerdotdev:main
Jun 11, 2026
Merged

[pull] main from triggerdotdev:main#215
pull[bot] merged 7 commits into
Dustin4444:mainfrom
triggerdotdev:main

Conversation

@pull

@pull pull Bot commented Jun 11, 2026

Copy link
Copy Markdown

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )


This change is Reviewable

myftija and others added 7 commits June 11, 2026 18:29
…workload manager (#3902)

`ComputeWorkloadManager.create` swallows gateway errors currently, so a
cold start that fails placement (e.g. a netns slot with a busy tap, a
full node disk) silently abandons the dequeued run until the run
engine's `PENDING_EXECUTING` heartbeat timeout redrives it via stall
detection.

### Changes

- Retry `instances.create` with short backoff (default 3 attempts, 250ms
backoff), recording `createAttempts` in the wide event.
- **Only statuses where the create definitely did not commit are
retried**: 500 (agent/fcrun create failed) and 503 (no placement).
502/504 are excluded — the gateway emits those when it fails to reach
the node or read its response, which can happen *after* the agent
committed the create; the gateway only records the instance name on a
clean 201, so a same-name retry would miss the collision check and could
double-create the VM on another node. Network-level fetch failures are
retried (if the gateway processed the create, its name index is
populated and the retry 409s harmlessly). Timeouts are not retried.
- **Retry attempts after a 5xx use a deterministic `-rN` name suffix**:
a failed create can leave its name registered until async cleanup runs.
Attempt 1 keeps the unsuffixed name.
…tes or disconnects (#3894)

The compute suspend flow delays snapshots by `snapshotDelayMs` (~30s) so
short-lived waitpoints skip the snapshot entirely, with the intent that
a run continuing before the delay expires cancels the pending snapshot.
But the only `cancel()` call site was the `/continue` action, which
runners only invoke when restoring from an already-taken snapshot — so
pending snapshots were never cancelled (zero `snapshot.canceled` events
ever emitted in prod). When a run resumed and completed inside the
window, the stale snapshot fired ~30s later anyway, pausing the VM 6–13s
mid warm-start long-poll; the frozen guest couldn't fire its abort timer
or send a FIN, causing stalls and run-engine driven retries.

### Change

- Cancel the pending snapshot on `attempt.complete` — after the platform
accepts the completion, before the HTTP reply (so it can't reorder with
the runner's next `/suspend`).
- Cancel on `runDisconnected` (crash, exit, or run replaced on the
socket).
- Both cancels are guarded by a runnerId match (new
`TimerWheel.peek()`): a stale duplicate runner for a reassigned run must
not cancel the fresh runner's pending snapshot. A missing runnerId falls
through to an unconditional cancel (the pre-existing `/continue`
behavior is unchanged).

Waitpoint suspensions keep the runner socket connected and the attempt
incomplete, so neither hook touches a snapshot that is still wanted.

Known limitation (fail-safe direction): `socket.data.runnerId` is frozen
at the websocket handshake, so after a same-supervisor restore the
disconnect-path guard refuses the cancel. The `attempt.complete` path
uses the runner's current header id and is unaffected.
## Summary

HIPAA BAA is offered as a paid add-on on every paid plan. Each paid tier
on the in-app pricing card now has a "HIPAA BAA add-on" row with a
"Request a BAA" link that opens the existing contact dialog pre-filled
with a new `hipaa` inquiry type, prompting the user for their company
name and a brief description of the PHI workload.

The contact form's `feedbackTypes` are restructured to match the
marketing /contact form: every inquiry type carries a Plain label ID and
a "Contact form: ..." thread title, so threads land in Plain identically
whether they come from the dashboard or the marketing site. The
included-compute line on each tier also picks up the credits wording
from the marketing pricing page, and the Enterprise tier lifts its title
above the features row.
…latency (#3907)

## Summary

Three related fixes for `chat.headStart` and continuation boots, found
while investigating customer reports.

**1. `chat.headStart` now works with `hydrateMessages`.** The turn-0
handover splice only ran on the default accumulation path, so agents
registering `hydrateMessages` silently lost the warm route's step-1
response: pure-text turns fired `onTurnComplete` with no assistant
message (and an empty durable write), tool-call turns re-ran step 1 from
scratch under a fresh `messageId`, and the head-start user message never
reached the hydrate hook at all. The first-turn history now reaches
`hydrateMessages` as `incomingMessages`, and the splice runs after both
accumulation branches, deduplicated by the handover `messageId`.

**2. Reasoning parts survive the handover.** The synthesized partial
only mapped text and tool-call parts, so an extended-thinking model's
step-1 reasoning streamed to the browser but never reached durable
history. Reasoning parts now map through with provider metadata, so
Anthropic thinking signatures survive a UIMessage round trip on hydrate
replays.

**3. Continuation boots no longer stall for ~10 seconds.** The `.in`
resume cursor was found by draining an SSE subscription that only closes
after its 5 second inactivity window, and the scan ran twice per boot.
It is now a non-blocking records read of the latest turn-complete
header, runs at most once per boot, the boot reads run concurrently, and
chat snapshots carry the cursor so subsequent boots skip the scan
entirely. Measured locally on a cancel-then-continue repro: pre-turn
continuation latency dropped from ~11s to ~0.5s.

Every fix was verified red-green: new unit tests reproduced each failure
before the fix, and end-to-end smoke tests against a live local stack
covered both handover legs, reasoning persistence with extended thinking
(including a follow-up turn that round-trips the persisted signed
reasoning back to the provider), and the boot timing comparison.

## Rollout

SDK-only; no server change required. A new SDK against a server that
does not serialize record headers degrades to the existing no-cursor
fallback. Old SDKs ignore the new snapshot field, and new SDKs fall back
to the records scan on snapshots written before it existed.
…controls (#3906)

## Summary

The run trace page loader serialized every span's raw OTel events (with
full properties) into the response, even though the tree UI only renders
the derived `timelineEvents` and the span detail panel refetches what it
needs. On event-heavy traces that inflated both the loader payload and
the server-side heap copies built per request. This PR keeps raw span
events server-side and pairs that with a few related trace-view
improvements:

- A new optional `TRACE_VIEW_EMERGENCY_SPAN_CAP` env var (unset by
default) clamps the trace summary and detailed trace summary span limits
on both event store paths, including the public run trace endpoint, so
operators can bound trace query sizes in one place without retuning the
per-store limits.
- The TreeView virtualizer resolved every rendered row with a linear
scan over the whole tree (and `getNodeProps` did the same via
`findIndex`); rows now resolve through memoized id lookup maps, which
matters once traces reach tens of thousands of spans.
- The run stream SSE lookup now applies the same organization membership
scoping as the rest of the run page presenters, for consistency.

Behavior is unchanged by default: the trace tree renders from the same
`timelineEvents` it always has, and the new cap only takes effect when
set.
…sts (#3912)

## Summary

`test/runsRepositoryCursor.test.ts` pinned its fixture runs to
`createdAt = 2026-06-04T16:55:07Z`. `listRuns` applies the default 7 day
window when no time filter is given, so the fixtures aged out of the
window at 16:55 UTC on 2026-06-11 and all five tests started failing for
every branch, regardless of what the branch changed. The tests were
green on their own CI two days earlier because the fixtures were only
five days old at the time.

This switches the fixture base to a relative timestamp (one hour ago),
so the fixtures stay inside the default window permanently. Verified the
suite goes 5/5 green with this change on the same environment where the
pinned dates fail 5/5.
## Summary

The environment variables page loaded every variable value in the
project, unfiltered by environment. Archiving a preview branch does not
delete its environment variable value rows, so projects that churn
preview branches accumulate values forever, and every page view loaded
all of them. On large projects this made the page loader take many
seconds and stalled the server while deserializing the oversized result.

## Fix

The presenter now loads the displayed environments first and filters the
`values` relation to those environment IDs. That matches the display
semantics exactly (per-user dev environments and active branch
environments included), and the lookup is covered by the existing unique
index on `(variableId, environmentId)`. Values in archived branch
environments are no longer fetched at all.

Covered by a new testcontainers test asserting that values from active
environments (including branch environments) are returned while archived
branch environments are excluded.
@pull pull Bot locked and limited conversation to collaborators Jun 11, 2026
@pull pull Bot added the ⤵️ pull label Jun 11, 2026
@pull pull Bot merged commit 8dc77c0 into Dustin4444:main Jun 11, 2026
0 of 5 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants